Skip to content

Avoid index clone in next#1584

Open
scho-furiosa wants to merge 3 commits intorust-ndarray:masterfrom
scho-furiosa:no-clone
Open

Avoid index clone in next#1584
scho-furiosa wants to merge 3 commits intorust-ndarray:masterfrom
scho-furiosa:no-clone

Conversation

@scho-furiosa
Copy link

@scho-furiosa scho-furiosa commented Feb 10, 2026

Hello, everyone.

My colleague @chaehhyun found that there are some regressions due to the clone in next during a profiling. We suspect ix.clone() inside Baseiter::next is the culprit, but not 100% for sure as of now. We are still experimenting its impact internally.

I have a couple of questions:

  • I am curious if the cloning of index in next is intentional.
  • If it is not the case, is there any pre-existing perf test in CI?

Thank you.

There is an avoidable `clone` in `next`.

This commit introduces `next_for_mut`, which is similar to `next_for`,
but updates the key in-place, and uses it in `next` instead.
@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

Hello! Sorry to hear about a regression. Can you provide a minimal working example of the regression? Or some additional debug information?

@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

Also, what version caused the regression?

@scho-furiosa
Copy link
Author

Also, what version caused the regression?

We are using the version 0.15.6. That said, I do not think this issue is related to the version, since the line ix.clone() has been there since its initial commit 2014.

Can you provide a minimal working example of the regression? Or some additional debug information?

AFAIU, this can be problematic when there are many dimensions (probably bigger than 6, i.e. IxDyn), so the cloning of the key is expensive. Basically, I just think the cloning of the key is unnecessary for normal iteration.

Re the minimal example, I am still curious if there are perf tests in the repo, or if you want it to have. This "cloning more or less" is not about functional correctness, so it is unclear which type of minimal example you have in your mind.

@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

Ah thank you for the clarification. A "regression" is usually meant as a drop in performance or correctness when compared to a previous version, so I thought this was behavior you were seeing after an upgrade.

Unfortunately, performance is a known issue when using dynamic-dimensioned arrays. I haven't had a chance yet to look into the cause, but it's possible that this fix would help. I'd be happy to follow up and test to see if it improves speed. This is where a minimal example would help: the smallest possible code that exhibits the slow behavior that you're seeing.

There are some limited benchmarks in the library, but nothing comprehensive.

@akern40
Copy link
Collaborator

akern40 commented Feb 11, 2026

These changes look good, I appreciate the contribution! I'll put together a small benchmark to ensure that it has a positive impact on performance (I'm 99% sure that it will) and then I will merge the changes.

Comment on lines 199 to 201
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this function calls next_for_mut now?

let mut index = index;
self.next_for_mut(&mut index)

@scho-furiosa
Copy link
Author

Here is a small exmaple. The perf change does not seem to be dramatic though.

#[test]
fn test_iter_perf()
{
    let a = ArrayD::from_shape_vec(
        vec![
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2,
        ],
        vec![3.0; 1024 * 1024 * 16],
    )
    .unwrap();
    for _x in a.reversed_axes().iter() {}
}

before:

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 7.02s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 6.90s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 6.89s

after:

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.60s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.40s

~/ndarray$ cargo test -- --nocapture test_iter_perf 2> /dev/null | grep iter_perf -A 3
test test_iter_perf ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 5.39s

A "regression" is usually meant as a drop in performance or correctness when compared to a previous version

Sorry for the confusion. My bad. /o\

@nilgoyette
Copy link
Collaborator

@scho-furiosa Can you please fix the errors?

IIUC, this PR can only make ArrayD faster, it can't have a significant impact on other arrays? Because we copying N usize instead of allocating and copying a small vector?

@scho-furiosa
Copy link
Author

scho-furiosa commented Feb 12, 2026

Can you please fix the errors?

Thank you. I fixed the clippy errors, but I am not sure how to handle the miri-related one.

IIUC, this PR can only make ArrayD faster, it can't have a significant impact on other arrays? Because we copying N usize instead of allocating and copying a small vector?

Exactly that is my understanding. The index copies in Array1-6 should have been cheap regardless of this PR, since they are copy-able in Rust.

@nilgoyette
Copy link
Collaborator

LGTM, but I'll let @akern40 do its small benchmark.

@akern40
Copy link
Collaborator

akern40 commented Feb 13, 2026

The last thing I want to see is the above ArrayD benchmark replicated for a fixed-rank array, say Array6. I see that there is a significant speedup from the change for ArrayD, but I want to understand what the "gap" was between ArrayD and Array6 and how that change has affected the gap.

@scho-furiosa
Copy link
Author

scho-furiosa commented Feb 13, 2026

but I want to understand what the "gap" was between ArrayD and Array6 and how that change has affected the gap.

Interesting point!

  • To precisely compare the times of iteration only, I added a timer.
  • For a fair comparison of ArrayD and Array6, revised the shape.
#[test]
fn test_iter_perfd()
{
    let a = ArrayD::from_shape_vec(vec![1024, 1024, 2, 2, 2, 2], vec![3.0; 1024 * 1024 * 16])
        .unwrap()
        .reversed_axes();

    let start = Instant::now();
    for _x in a.iter() {}
    let duration = start.elapsed();
    println!("Time elapsed is: {:?}", duration);
}

#[test]
fn test_iter_perf6()
{
    let a = Array6::from_shape_vec((1024, 1024, 2, 2, 2, 2), vec![3.0; 1024 * 1024 * 16])
        .unwrap()
        .reversed_axes();

    let start = Instant::now();
    for _x in a.iter() {}
    let duration = start.elapsed();
    println!("Time elapsed is: {:?}", duration);
}

Here is the result.

ArrayD before

Time elapsed is: 4.147015005s
Time elapsed is: 4.185004973s
Time elapsed is: 4.152269005s

Array6 before

Time elapsed is: 2.877281448s
Time elapsed is: 2.874772023s
Time elapsed is: 2.847792364s

======

ArrayD after

Time elapsed is: 2.870766284s
Time elapsed is: 2.859789082s
Time elapsed is: 2.862473005s


Array6 after

Time elapsed is: 2.761921391s
Time elapsed is: 2.771982213s
Time elapsed is: 2.749912549s
  • A significant speed up in ArrayD, about -31%.
  • A minor speed up in Array6, about -3%.
  • The gap between ArrayD and Array6 was x1.45 and now is x1.04 with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants